63 research outputs found

    Rayleigh and Prandtl number scaling in the bulk of Rayleigh-Benard turbulence

    Get PDF
    The Rayleigh (Ra) and Prandtl (Pr) number scaling of the Nusselt number Nu, the Reynolds number Re, the temperature fluctuations, and the kinetic and thermal dissipation rates is studied for (numerical) homogeneous Rayleigh-Benard turbulence, i.e., Rayleigh-Benard turbulence with periodic boundary conditions in all directions and a volume forcing of the temperature field by a mean gradient. This system serves as model system for the bulk of Rayleigh-Benard flow and therefore as model for the so called ``ultimate regime of thermal convection''. With respect to the Ra dependence of Nu and Re we confirm our earlier results \cite{loh03} which are consistent with the Kraichnan theory \cite{kra62} and the Grossmann-Lohse (GL) theory \cite{gro00,gro01,gro02,gro04}, which both predict Nu∼Ra1/2Nu \sim Ra^{1/2} and Re∼Ra1/2Re \sim Ra^{1/2}. However the Pr dependence within these two theories is different. Here we show that the numerical data are consistent with the GL theory Nu∼Pr1/2Nu \sim Pr^{1/2}, Re∼Pr−1/2Re \sim Pr^{-1/2}. For the thermal and kinetic dissipation rates we find \eps_\theta/(\kappa \Delta^{2}L^{-2}) \sim (Re Pr)^{0.87} and \eps_u/(\nu^3 L^{-4}) \sim Re^{2.77}, also both consistent with the GL theory, whereas the temperature fluctuations do not depend on Ra and Pr. Finally, the dynamics of the heat transport is studied and put into the context of a recent theoretical finding by Doering et al. \cite{doe05}.Comment: 8 pages, 9 figure

    The Hypothesis of Superluminal Neutrinos: comparing OPERA with other Data

    Full text link
    The OPERA Collaboration reported evidence for muonic neutrinos traveling slightly faster than light in vacuum. While waiting further checks from the experimental community, here we aim at exploring some theoretical consequences of the hypothesis that muonic neutrinos are superluminal, considering in particular the tachyonic and the Coleman-Glashow cases. We show that a tachyonic interpretation is not only hardly reconciled with OPERA data on energy dependence, but that it clashes with neutrino production from pion and with neutrino oscillations. A Coleman-Glashow superluminal neutrino beam would also have problems with pion decay kinematics for the OPERA setup; it could be easily reconciled with SN1987a data, but then it would be very problematic to account for neutrino oscillations.Comment: v1: 10 pages, 2 figures; v2: 12 pages, 2 figures, improved discussion of CG case as for pion decay and neutrino oscillations, added reference

    Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications

    Full text link
    The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6 Tflops in single precision - but sustained performance depends critically on how well all parallel features of the processor are exploited by real-life applications. We assess the performance of this processor for Lattice Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP code we consider several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance. We make some performance comparisons with other processors and accelerators, and also discuss the impact of the various memory layouts on energy efficiency

    Optimization of lattice Boltzmann simulations on heterogeneous computers

    Get PDF
    High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs

    Performance and portability of accelerated lattice Boltzmann applications with OpenACC

    Get PDF
    An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures

    FFT for the APE Parallel Computer

    Get PDF
    We present a parallel FFT algorithm for SIMD systems following the `Transpose Algorithm' approach. The method is based on the assignment of the data field onto a 1-dimensional ring of systolic cells. The systolic array can be universally mapped onto any parallel system. In particular for systems with next-neighbour connectivity our method has the potential to improve the efficiency of matrix transposition by use of hyper-systolic communication. We have realized a scalable parallel FFT on the APE100/Quadrics massively parallel computer, where our implementation is part of a 2-dimensional hydrodynamics code for turbulence studies. A possible generalization to 4-dimensional FFT is presented, having in mind QCD applications.Comment: 17 pages, 13 figures, figures include

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Massively parallel lattice–Boltzmann codes on large GPU clusters

    Get PDF
    This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics

    Portable multi-node LQCD Monte Carlo simulations using OpenACC

    Get PDF
    This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.Comment: 22 pages, 8 png figure
    • …
    corecore